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ABSTRACT 

Two modes of evaluation are compared: the summary 
evaluation by supervisors performed at six-month intervals, and the 
technique of direct observation of a clinical encounter through 
one-way glass. The sample consists of 17 residents in pediatrics who 
were evaluated, using both methods, over an eigh^-month interval. The 
analysis of data indicates that the reliability of the direct 
observation technique is acceptable, in contrast to the low 
reliability of the supervisor's assessment. A positive correlation 
•exists between evaluations obtained from each method, suggesting that 
the two methods are measuring the same behaviors, but the results are 
not significant, probably because of the low reliability of the 
supervisor's assessment. Finally, both methods showed the expected 
change witli educational level, with the direct observation scores 
displaying a change of two to three times the supervisor's 
assessments. The results indicate that the method of direct 
observation is a more reliable and valid assessment technique than 
the assessment by supervisors. The implications of this conclusion 
are discussed. (Author/BB) 
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THE VALL)..^: ^:rLEoI-v..:.n^VAVio:, 
IN A3Sr^KEr;T OK CLINICAL ohllLLI 

A. Finkel, M,D,, and G, R. Norman, Ph.D. 

Introduction ; 

Recent recognition has been given to the inadequacy of present testin^^, 
instiMments in the assessment of clinical skills. Ample evidence has accumulated 
that the traditional informatiCD-'Oriented examination correlates poorly v^th 
subsequent clinical performance . This evidence has led to innovations ixi the 
certification examination^ »3 , use of fonnative evaluations as a component of 
the certification process^, and directives for investigation of new evaluation 
techniques at the national level^. 

Since the majority of these evaluation techniques involve some degree of 
simulation of the physician- patient encounter, ranging in fidelity from the use 
of one-way glass to observe a workup of the real patient, to the paper-and-pencil 
format of the Patient Management Problem^, it is essential to examine both the 
internal reliability of the method, and its external validity, Oie problem in 
establishing the validity of any evaluation of clinical skills is the absence of 
an;^' objective measure of clinical competence, and the validity riiist generally be 
inferred from indirect analyses. 

In the present paper, we focus on the direct-observation of the clinicfiil 
workup, using either real or sinailated patients. The reliability of the 
technique has been established^, and preliminary data suggest a poaitive cor^ 
relation with a similar assessment by clinical suj>ervi3ors. In the present work, 
the independent assessment by clinical supervisors, an evaluation mode which has 
gained widespread acceptance^, will be examined in greater detail, and the 
relative validity of the two methods infenred from an analysis of the reliability 
of each method, examination of change in evaluations with educational level, ana 
a correlation between methods. 

Materials and Methods ! 

a) Residents 

Of the 15 Pediatric residents evaluated, eight were first year and seven 
were seconi year residents. Their medical backgrounds varied greatly; most had 
graduated from medical schools in their countiy of origin and had been in Canada 
for varying periods of time. "Rie residents spent three month rotations on 
nursery, in-patient or ambulatory services in the Pediatric program. 

b) g'/aluators 

^valuators were seven general Pediatricians in consulting practxce, who 
were heavily involved in patient care and serve as attending physicians on the 
various services in the residency program. 
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c ; l>^r>».: Terr. A::3essr.ent- 

Ai; •»va.lu:itio:; :\.rr. w.;/ . ^Tit tc tw ■ or r:;.:>rt' : r.y-:c: .j. w:.o ha.; :*a.i 
: r:iOt3t c-'Htact wit:. Vht"* refi.ier.^ jv^rij^^^ hi.: r."~t •:it.ic:. .:. a . '-u^ ^-ervice, 

7:'.'^ ..-v 2ic iariL^ were ^'^-^nerally thco^^ .lutendlr:^ -riat i^-rv..:^ or ': idnittiri^* 

lati^^^tj * ' t j^*rv:ce Jiurir.r the resiaent'r rcr jt : jru A:. ^. : cxri: ctny uif^ 
letter w:i5 of^rit. t ■., the^ie ri^'sicia:.^ asKijnsg the'^ t.o o.or:;i- . ^'•t^^ t.:.^' :\r;:.a triey 
felt the^.' had haa sufficient contact with the reslient to evaluate war.-:, 

T::e acorei-' ar.d ccrrjr.ents of each evaluator were :r;:,ar.: *:evi .ir.d rf'^^^ibr-. ,\k 
.t.-i-u^revi thr'Cn.;g!: the sn^^all tutorial groups ir. which eacr. rerident part-cipatei, 
.;;-re3 were derivevi try assigning nunericad values to the categories or. -.he f'^rr;, 
Ti^^e evai^uations were cor:ipletad in October 197*^ ar.u Yei ruar:/ ly7;^ re*:teL"ting 
on each occasion reports covering the previous three ront:.;; rotation, 

d) oin^Je Sncounter 

In this fonn of evaluation, the resident was observed doing a historj' arid 
physical examijiation on a patient, the observers watching from behind a one-wa^- 
viewing screen eauippt^ with audio facilities. Teams of two evaluator j who 
were general consulting pediatricians were select ea. The patients were those 
of one of the two evaluators. The p>atients were fully informed prior to being 
seen of the fact that they would be obsex'^^'ed and their consent obtained. The 
patients selected for jiuaior residents were generally single problems (obeaity, 
en^oresis, aWofninaLl pair.) while those selected for senior residents were more 
complex (^yridromes associated with mental retar-dation and behavioural disorders, 
or con:plex multi- system diseases). 

On two occasions becfioise of patient cancellations it was necessary to use 
a prcgraTTT.ed pi\tient. An infant or toddler from the adjoining pediatric ward 
was picked as the patient, A nurse from the ward was given a prepared history 
which coincided with the child's clinical status. The nurse was coached a fen 
days prior to the evaluatio.ns as to how to perform as a prcT^ainmoKi r:.ot.her» 
IXiring evaluations involving this patient neither evaluatc^^a or residents had 
prior knowledge nor guspecteri afterwards that the nurse was not the child's 
real -Tother. 

Instructions to evaluators were as follows i 

E^^aluators were to compare the resident's performance to that of an expert 
pediatrician. They were asked to become familiar with the evaluation form 

which outlined specific areas in which the resident's performance was to be 
Juigod. They were given brief summaries of the patients' problems which listed 
pertinent, negative and positive features of the history and physical and which 
included the suggested plfixi of management. By using this summary and by taking 
brief notes while observing the resident the evaluator could compare the history 
aiKi physical obtained by the resident to the suirmarized findings axKl could see 
errors of omission and technique. During observation, the evaluator c^Tuld rxore 
ail parts of the evaluation form except for problem f emulation and plan of 
management, both of wbl.ch were scored after the resident presented thia xr^f ormatior. 
to the evaluators in the feed-back session following the obseirvation. 

The explanations given to the residents prio>^ to this form of evaluation 
atre^ised that the eveJuation was to be viewed a3 an exercise rather than ar. 
exar nation. The residents were told that the patients they would see would 
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>?.ercr6Llly be those of a pediatric .ai^. who would le evaluatiiig them. The patient 
w^xxlJ tre returr^.iiig f^r f.-Ilcw-up of a particular problarr; arnJ the resident's 
gL>al li. seeir^^: the patierit was to determine the natuT'? cf t))e chj-onic problem 
as well ai5 the rLirr**j;t clinical status referrable to that problerr.. It was 
explained that the re^iaerit would be observed during history aind physical 
exair.inat ion arvi w-Ti;ld then be expected to fonnulate a plan of investigation and 
managecDent appropriate for the patient's current problems. The Initial part of 
the feed-back session with the evaluators would be the resident's presentation 
if the fjatient's problems as he saw thera, his plans for investigation and manage- 
iwi.^.t of those problems. 

These evaluations took place in October 1972 ani March 1973. Two days 
we; e scheduled for the evaluation of 15 reaidents on each occasion. Patients 
we: 9 given conaecutive on©-ho^ appointmentet EJvaluator teams usually worked 
for one-haLf day. The evaluation of one resident took place in one hour. The 
raj^ident was allowed forty minutes with the parent and child during which he 
wme monitored by the evaluators in the viewing room. In the next twenty minutes, 
the resident met with the evaluators and presented the patient to the evaluators. 
Thl? was followed by a discussion of the resident's performance with him by his 
eva.aators. 

Analysis of Data t 

Data analysis was directed to an investigation of firstly, the internal 
coaalstency or reliability of each category for the single encounter (SE) and 
lon^tertD (LT) evaluations, and secondly, by examining the correlations between 
SK and LT assessments, and the change in evaluations over the eight-month intexn^al, 
the validity of each method. 

As an initial step, distributions of raw scores accuntilated over all 
categories and all evaluators were plotted as shown in Figure I . Prom the figure, 
it is evident that the SE scores are distributed broadly over the range of possible 
values, with a calculated standard deviation of 0.98. By contrast, the LT 
Mtijuates follow a much narrower distribution, *rL th 85% of all scores falling 
in the range >-^, and a standard ievlation of 0.49. Secondly, 3% of SE 
erraluations fell in the '•not applicable*' category versus 9% of LT estimates. 
These resxilts provide a measure of the ability of the instrument to discriminate 
lerels of performance, and indicate that the SE evaluations have greater discrim- 
ination. 

The raw scores were then utilized in a calculation of reliability, using 
the method of split^halves, and the Spearman-Brown formula^. Reliability co- 
efficients for e-zh category are shown in Table I. 

Considering first the SE evaluations, thirteen of the eighteen categories 
had reliabilities greater than Two categories, Investigations 

and Treatment, had reliabilities of about 0.3, and three categories, Problem 
Orientation of History, Priority of Problems, and Disposition , had 
reliabilities in the range 0 to 0.1. It is evident that difficulties were 
present in assessing problem formulation and management, a result at variance 
with previous analyses using this form-. The difficulties ma.y be due in part to 
the aseessment of probl^ forrnulation in discussion w:th the resident rather 
than from a written recoru, the method formerly utilized. The low reliability 
of the history category is difficult to rationalize, as other similar oat'Sgories 
h»d high reliability. The average of all reliabilities was 0.50. 
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Ir. ar.alyci:.^ the VcUiiity :c.y f^valuation cl::;. pr!rron:.-^n.:e, t!ie 
bas:::^ :u^\-ti:"r. t >j be :i:\3were^i :r *':iow w^^ll aoec t:.e ^'va .u'.:: i or. instmr^^"?:.* r^t'lect 
tao aacituil ca^ perrGrr::a::ce cf the jur ject p:*^'SiC : ai. V'' 7'he ju-^stiv::. : 

part.ic\:iariy coger;t. vhen applied to tne certal'icatior pr^-.c'^^ss, arid on-^ rari-ifi- 
^ation has oeen t.he use of ongoing evaluation, sirrdlai- tc- tno long-terr. aase^s- 
r.ent of the pre.-er.t work, as a conp'Onent of the ce^ifioaticn grading the 
Royal College of Phj^siciaris and Surgeons of Car.ada^, However the analy-:^3 of 
relia: ility in the precedinig section would :_ndicate that thi3 method ^l' in it^ 
self juf f icientl;v unreliable to raise questions aboi t the external validity of 
f,uch evaluations, 

r,^.nce, at the present tine, there is no independent, reliable measure of 
clinical skills with which the oE and LT evaluations ccrula be cornpared, the 
analysis of validity was approached indirectly, by first correlating evaluations; 
fror each method to ascertain if the two methods were assess :^ng the sa^ie 
characteristics, and by examining the change in evaluations from September to 
March (construct validity). 

Since the categories assessed in each method were not identical, a first 
step was to group categories, and average scores, in such a way as to develop 
common characteristics. 

Correlation coefficients for the seven grouped categories are shvown i:. 
Table II, Sia coefficients are positive,but none reach sigrJJ*icancr ,.t t^e ,C . 
level. Two conclusions may be drawn from these results; that the ..^fferer.t 
methods are assessing different characteristics, or t . • ..\e u-.ireliability of 
the LT estiinates precludes any meaningful comparison with other measures. 

Analysi'3 of the change in evaluations from September to March is shown in 
Table II. .;t will be noted, that although all changes are in the positive 
direction, cliange in the SE estimates xb approximately tvrice the observed 
change in LT data. An interesting observation is that the category which least 
changes in the SE assessments ^ (G-Patient Interaction), is that which shows 
the greatest change in the LT estimates. Since the clinical 8upervi.sor rarely 
observes the resident in a one-to-one relationship with patients, it is 
postulated that the large change in the LT estimates is a reflection* of the 
supervisors o^n greater familiarity with the resident. If thi^ category is 
removed from :he average, the average change in SE estimates is O.3OO, compared 
with 0.109 for the LT estimates. 

Discussion of Results ? 

The analysis of reliability indicates that the SE method results in subjective 
evaluations with a fair degree of reliability. Certain areas, particularly 
problem formulation and management were inadequate, and may be improved by 
assessments based on a written record. Other tacts which may be utilized to 
improve the reliability of the data include the development of descriptors to 



rot'T.al rv^.^:h ^'njure:" that cer^,;.; :e:.a;:^.rr Ccc ; ^• i /"rvtM rrivsiic^. 



.'jritract , the 1 ^^r::; 03t:rate3 were char.; .:t'-. l-;.* •• x-t^r-T'o^, 

var..ar>^ re". lat . • .: ty , ;,rinaLrily necaune the iristruiTient r-'oe^oseki ^cv, i..'cr.r.-ri- 
atcry rower, re?;.ltinf', in a narrow rar.ge of scorer. 

7:.e analyrj^j validity 5ubjta.";tiates the^e o: S':*rva:. : or.3 , i:: t':v i .:.rr^>- 
.aticr.j tft^ween SE aiui LT estimates were low, ajid not jtatisti.:ally 31.7:: ficant. 
P\irth>^rr.ore the expected progression with educatiOTial leve^ waa present ^.t:'. r::e 
SE ar.a LT eatinates, l^t the average change with SE evaluati^THS was 2-_; tir.ea 
the :y.:ui^,e observed in LT assessments. 

Conclusions ; 

The data presented has serious implications at a t:LTie when certifying 
loar^a.- are recognizing formative ev^aluation as a component of the cei*tif ication 
process. The assessment of clinical skills by supervising fr.ctilty, who complete 
a form on a periodic basis, has shown to have little value as an evaluative 
instrument, and if formative evaluation is to effectively provide information, 
it will be necessary for educators to examine in detail a nariber of alternate 
r.ethodologies for achieving this evaluation. One alternative is the single 
encounter assessment, which although retaining the liabilities of subjective 
evaluation I appears to be more reliable than the supervisor's assessment. 

It is Intended to repeat this analysis in the near future, using a larger 
sample of about fifty residents in internal medicine, and examining the reliabilat>- 
of the assessment form used by the Royal College of Ph^^^sicians and Surg-^onc. of 
Canada. If the restilts of this study are verified with the larger a.^.:rle, it 
will be the task of medical educators to develop and tet>t alternative evaluation 
mcxlalities, such as the single encounter aspessaent, am examine p)03sible ways 
in which the reliability of these methods can be improved. 
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